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Abstract 

Background: Comparisons of patient experiences between providers are increasingly used as an index of 
performance. Tine present study describes tlie ability of patient experience surveys to discriminate between 
healthcare providers for various patient groups and quality aspects, and reports the sample sizes required for 
reliable (comparisons of) provider scores. 

Method: The consumer quality index is a family of surveys that are tailored to specific patient groups. Data was 
used from patients who underwent cataract surgery, patients who underwent hip or knee surgery, patients 
suffering from spinal disc herniation and patients suffering from varicose veins. Multi-level regression models were 
fitted to assess the proportion of variance in patient experiences that is attributable to providers for various quality 
aspects. 

Results: The proportion of variance in patient experiences that is attributable to providers varied from 0.001 to 
0.054. The required sample size for reliable estimates at the provider level varied from 41 to 1967 per provider. 
Differences in discriminative power between patient groups and/or quality aspects were inconsistent, with one 
exception: for all groups, the discriminative power of experiences regarding change in physical functioning was 
particularly limited. 

Conclusions: From a statistical point of view, the discriminative power appears limited. The sample sizes required 
for reliable estimates are often substantial and deserve careful consideration when setting up measurements. 
Future research should evaluate the discriminative power by validating differences between providers in patient 
experiences with other indices and should explore other, more sensitive measures of patient experiences regarding 
treatment-related changes in physical functioning. 



Background 

It has been proposed that competition between health- 
care providers may increase the quaUty and cost-effec- 
tiveness of healthcare [1]. For competition to emerge in 
healthcare, the availability of comparative data on the 
performance of health care providers is considered 
essential [2,3]- One way to generate such data is to mea- 
sure patients' experiences of the care they received and 
compare those experiences between providers [4,5]. 
Indeed, measurement of patient experiences is now a 
common strategy for monitoring healthcare provider 
performance in a number of countries and performance 
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information is frequently made available to facilitate 
consumer choice [5-12]. 

In the Netherlands, the Ministry of Healthcare pro- 
motes the consumer quality (CQ) index as the Dutch 
national standard for measuring patient experiences. 
The CQ-index is an instrument inspired by two other 
types of surveys: the American CAHPS (Consumer 
Assessment of Health care Providers and Systems) 
[4,13] and the Dutch QUOTE (QUality Of care Through 
the patients' Eyes) [14-17]. The CQ-index is character- 
ized by its disease-specific and provider-specific focus as 
well as the assessment of patient priorities, which are 
both derived from QUOTE. From CAHPS, the CQ- 
index adopted the layout, response scales and standar- 
dized sampling, data collection, analysis and presenta- 
tion. Similarly to both the CAHPS and QUOTE, the 
CQ-index focuses on patient experiences, rather than 
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patient satisfaction. The underlying assumption is that 
measures of actual experiences with the quality of care 
will be less subjective than evaluative measures of 
satisfaction. 

One of the main purposes of the CQ-index is to pro- 
vide performance indicators on quality of care from the 
patient perspective. During the development of a CQ- 
index, a consortium of stakeholders is formed, which 
typically includes governmental bodies, associations that 
represent healthcare providers, health insurance compa- 
nies and patient organizations [18]. This consortium is 
consulted and kept informed during the development of 
the survey to ensure the various stakeholders accept the 
resulting instrument and the indicators derived from the 
instrument. Such a consortium generally also organizes 
nationwide data collections for the measurement of 
these indicators. 

In the context of institutional performance, competi- 
tion and consumer information, the measurement and 
publication of indicators based on patient experiences is 
particularly informative if these indices show differences 
between providers. This is necessary when data is used 
in benchmarks in order to detect best practices and sti- 
mulate quality improvement. It is also of paramount 
importance if data are used as consumer information 
aimed at facilitating patient choice. After all, if there are 
no differences, there is not much to choose from. The 
discriminative power of the instrument must therefore 
be sufficient, but what is sufficient? Ideally, the discrimi- 
native power of a patient survey is at least enough to 
meet the following criteria: (1) the instrument detects 
significant differences between healthcare providers, and 
(2) the sample sizes required for reliable estimates at the 
provider level - and reliable comparison of those esti- 
mates - are available for each provider. There is of 
course a third criterion, namely that the differences 
detected between providers should reflect meaningful 
differences in care or service. The data available to the 
authors does not allow this criterion to be addressed, 
but this issue will be revisited in the discussion section. 
Naturally, criteria for the discriminative power of sur- 
veys should be met following the necessary adjustments 
for differences in case mix [19]. 

Future projects that seek to develop patient experience 
surveys may find empirical data that illustrates the dis- 
criminative power of such surveys in a variety of settings 
to be useful. Such data may guide expectations on the 
discriminative power of the survey under development, 
and may help choose the unit of analysis at which provi- 
ders are compared such that the expected number of 
respondents required per unit of analysis may be 
achieved. 

The means by which the discriminative power of a 
patient experience survey may be tested depends in part 



on the analytical strategy that is used. A common way 
to analyse data on health care provider performance 
that is widely recommended [19-21] and has also been 
adopted by the CQ-index [5,12,19] is multi-level model- 
ling. These models resemble more common analytical 
strategies such as analysis of variance or regression ana- 
lyses, with two important differences: (1) the multi-level 
model decomposes variance into that attributable to 
healthcare providers and that attributable to other 
sources such as individual differences, and (2) the multi- 
level model accounts for the fact that individuals within 
healthcare providers are not independent from one 
another [19,22]. As a general assessment of differences 
between providers, the variance attributable to providers 
can be tested for significance. The magnitude of the var- 
iance between healthcare providers may then be 
expressed as a proportion of the total variance on a 
scale from 0 to 1 (intra-class correlation coefficient; 
ICC). Additionally, comparisons between healthcare pro- 
viders can be made to determine whether a given 
healthcare provider differs significantly from any of the 
other healthcare providers. 

Several studies have reported the ICC's for patient 
experience surveys. For example, Stubbe et al. reported 
that the ICC's for cataract surgery varied from not sig- 
nificant (nurses communication) to .03 (ophthalmolo- 
gist's communication) [12]. In another study, the ICC's 
for hip or knee surgery were reported to vary from not 
significant (communication about medication, pain con- 
trol, global rating of hospital) to .03 (doctor's communi- 
cation, nurse's communication) [5]. Furthermore, 
Damman et al. reported that the ICC's for health plans 
varied from .02 (health plan information) to .05 (global 
rating) [19]. In addition, Zaslavsky et al [23] reported 
the percentage of variance in experiences that was 
explained by health plan and a number of geographical 
variables. For the vast majority of quality aspects, the 
variance explained varied from 0.4% to 6.0%, which cor- 
responds to the ICC's reported in the aforementioned 
studies. Further, Hargraves et al. [4] reported the num- 
ber of respondents required for reliable estimates of per- 
formance scores per health plan, which is also indicative 
of the magnitude of differences between providers, as 
fewer observations are required when differences are 
large. For global ratings, the required number of respon- 
dents varied from 49 (global rating of health plan) to 
287 (global rating of specialist). For composite measures, 
the required number of respondents varied from 64 
(getting the care that was needed) to 169 (doctors who 
communicate). Although it was concluded that the plan- 
level reliability was impressive, it is also worth noting 
that with response rates varying from 24 percent to 57 
percent between plans, sample sizes should exceed 500 
for most plans to obtain the required number of 
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respondents for reliable estimates at the provider level 
for both the global ratings and the composite measures. 
Solomon et al., [24] reported on a survey to evaluate the 
performance of medical groups. The required sample 
size for reliable scores at the medical group level was 
reported to vary from 52 (access to care) to 1340 (pre- 
ventive counselling). Finally, Keller et al., [25] also 
reported the reliability of performance scores of compo- 
site measures at the hospital level. They assumed a 
response of 300 per hospital and most reliabilities 
appeared satisfactory, ranging from 0.66 (medicine com- 
munication) to 0.89 (nurse communication; responsive- 
ness). To sum up, although studies do point in broadly 
the same direction, there are differences between studies 
regarding ICC's or regarding the required number of 
respondents for a satisfactory unit-level reliability. As 
such, it is intriguing what drives these differences. 

The Consumer Quality Index (CQ-index) is a family 
of surveys for measuring the patient perspective that 
allows us to examine the magnitude and reliability of 
differences between health care providers in various 
patient groups for various quality aspects. In the present 
study, we seek to describe the discriminative power of 
CQI surveys for several quality aspects in various set- 
tings. Data was used from patients suffering from vari- 
cose veins, patients who underwent hip or knee surgery, 
patients who underwent cataract surgery and patients 
suffering from spinal disc herniation. 

The following research questions will be addressed: 

1. What is the discriminative power of the patient 
surveys at issue? 

2. Does the discriminative power of patient surveys 
vary across different measures and/or patient 
groups? 

3. What sample sizes are required for reliable esti- 
mates of provider scores? 

Methods 

Participants 

All data was collected in the Netherlands using self- 
administered surveys. Patients were identified through 
insurance companies and/or hospitals and approached 
by mail on up to four occasions: an initial questionnaire 
accompanied by a letter, a thank you/ reminder note one 
week later, a reminder mailing for non-respondents that 
consisted of the questionnaire and a letter another three 
weeks later and a final reminder letter for non-respon- 
dents another two weeks later. The dataset for patients 
who underwent hip or knee surgery consisted of 1514 
patients from 43 hospitals (response = 75.0%), the data- 
set for patients who underwent cataract surgery con- 
sisted of 4126 patients from 55 hospitals (response = 
71.7%), the dataset for varicose veins consisted of 2195 



participants from 20 hospitals (response = 61.5%) and 
the dataset for spinal disc herniation contained 1648 
patients from 20 hospitals (response = 42.3%). The 
number of observations per provider varies within and 
between the datasets used, but since the present paper 
does not report estimates for individual providers this 
presents no major limitation. Data on the demographic 
characteristics (age, self-observed health, education and 
gender) is presented in Table 1. 

The studies in which the data was collected were per- 
formed in accordance with the Declaration of Helsinki. 
Research by means of surveys that are not taxing and/or 
hazardous for patients is not subject to the Dutch Medi- 
cal Research Involving Human Subjects Act (WMO). 
Accordingly, ethical approval was not required. All sur- 
veys were accompanied by instructions including a 
statement that participation is voluntarily and 
anonymous. 

Selection of patient experiences 

For the purposes of the present study, we selected 
experiences with patient-doctor communication and 
experiences regarding the effect of treatment in terms of 
changes in physical functioning as, for these experiences, 
composite measures could be calculated for each survey. 
The items underlying composite scores for patient-doc- 
tor communication are presented in Table 2, along with 
their internal consistency (Cronbach's coefficient alpha: 
0.81 - 0.92). The response categories for these items 
were: never-sometimes-usually-always. The items vary 
somewhat between surveys, as surveys are developed in 
separate projects, each with a separate consortium of 
stakeholders that is consulted for decisions on the con- 
tent of questionnaires. Furthermore, composite scores 
were calculated for the extent to which relevant ele- 
ments of physical functioning were improved as com- 
pared to the start of treatment. For all surveys, the 
response categories regarding physical functioning were 
"worse-similar-better" than before treatment. For the 
survey for patients that underwent a cataract surgery, 
items underlying this composite score contained 12 
items (Cronbach's coefficient alpha = 0.90) covering 
issues such as being able to see things from a close dis- 
tance or far away, being able to cope with bright lights, 
being able to drive etc. For the survey on hip or knee 
surgery, the composite score also consisted of 12 items 
(Cronbach's coefficient alpha = 0.95) and covered issues 
such as stair climbing, pain, standing, walking etc. In 
the case of varicose veins, this composite entailed 9 
items (Cronbach's coefficient alpha = 0.91) and covered 
issues such as feelings of fatigue in the legs, pain, stand- 
ing, physical appearance etc. For spinal disc herniation, 
the composite contained 22 items (Cronbach's coeffi- 
cient alpha = 0.94) and covered issues such as stair 



Table 1 Demographic characteristics of the patient populations 



Age Education General health Gender 

18-44 45-64 65+ Low Medium High Poor Fair Good Very good Excellent Male Female 



Hip or knee surgery 


29 


2% 


413 


27% 1072 71% 


962 


64% 397 26 


i% 1 55 


10% 


12 1 % 


196 


13% 


785 


52% 


260 


17% 


261 


17% 


432 


29% 1 082 


71% 


Varicose veins 


640 


29% 


1212 


55% 343 1 6% 


645 


29% 948 43 


!% 602 


27% 


19 1 % 


278 


13% 


1371 


62% 


362 


16% 


165 


8% 


394 


18% 1801 


82% 


Cataract surgery 


34 


1% 


628 


1 6% 3374 84% 


2551 


63% 1009 2^ 


i% 476 


12% 


96 2% 


1143 


28% 


2045 


51% 


473 


12% 


279 


7% 


1522 


38% 2514 


62% 


Spinal disc Inerniation 


493 


30% 


825 


50% 329 20% 


562 


34% 667 4C 


)% 419 


25% 


69 4% 


504 


31% 


818 


50% 


184 


11% 


73 


4% 


845 


51% 802 


49% 



de Boer et al. BMC Health Services Research 201 1, 11:332 
http://www.biomedcentral.eom/1 472-6963/1 1 /332 



Page 5 of 1 1 



climbing, standing up, walking, back pain, mobility etc. 
Finally, each survey contained a global rating of care 
and a question addressing the extent to which a patient 
would recommend his or her healthcare provider to 
family and friends; both were included in the analyses 
for the present paper. 

Data analyses 

The discriminative power of the surveys at issue was 
assessed using multi-level modelling. For all surveys, the 
models included two levels: the individual and the 
healthcare provider. The healthcare provider is the hos- 
pital or hospital department rather than an individual 
doctor, as reporting quality scores for individual doctors 
is a heavily debated issue in the Netherlands with regard 
to privacy legislation. In addition, it is unlikely that 
healthcare providers would cooperate with quality mea- 
surements if results would be reported per individual 
doctor. 

We first fitted a series of empty models and calculated 
the intra-class correlation coefficient (ICC). The ICC 
reflects the proportion of total variance that is attributed 
to between-provider differences and is used as a general 
measure of discriminatory power. Subsequently, we 
accounted for the variables age, education and self-rated 
health, which are commonly identified as case mix 
adjusters and evaluated the impact of this case mix 
adjustment by its effect on the ICC. In the case of 
experienced change in physical functioning, self-rated 
health was not included in the case-mix-adjusted model 
as it is plausible that patients who experience no change 
or worsening of their physical functioning would also 
rate their own health as lower compared to patients 
whose physical functioning improved. Accordingly, 
adjustment for self-rated health would remove real dif- 
ferences in experienced change in physical functioning. 
Further, the range in which 95% of the providers' means 
are expected to occur was determined as the average 
across all provider means plus or minus two standard 
deviations (SD), where the SD is calculated as the square 
root of the variance at the provider level. The required 



number of respondents to achieve a reliability at the 
provider level of 0.70 or 0.80 was also calculated [[22], 
p59]. In contrast to the reliability indicated by Cron- 
bach's coefficient alpha - where items of the same com- 
posite are expected to agree within individuals as they 
measure the same construct - the provider level reliabil- 
ity is based on the theory that patients treated by the 
same provider should agree in their assessments of that 
provider. If agreement between patients from the same 
provider is limited, more respondents are required to 
achieve a reliable estimate of the performance of that 
provider. 

Results 

The ICC's for the empty and the corrected models are 
presented in Table 3. As can be seen in Table 3, the 
corrected models generally display a reduced ICC com- 
pared to the empty models, suggesting that some of the 
differences between healthcare providers that are 
observed in the empty model may be explained by dif- 
ferences in their populations on the case mix adjusters. 
This phenomenon was least pronounced for the global 
rating (see Table 3). 

Focussing on the adjusted model - which is arguably 
the model of choice [26,27] - it can be observed that the 
ICC varies from 0.001 (change in physical functioning; 
cataract surgery) to 0.054 (global rating; varicose veins). 
In a number of cases, the variance at the level of the 
healthcare provider was not statistically significant. This 
was particularly the case for change in physical func- 
tioning: the variance at the level of healthcare providers 
was significant only for varicose veins. Further, variances 
at the level of healthcare providers were not significant 
for doctors' communication in spinal disc herniation, 
the global rating for both spinal disc herniation and hip 
or knee surgery and recommendation to others for hip 
or knee surgery (see Table 3). In sum, the extent to 
which differences in experiences between individuals are 
attributable to their healthcare providers appears limited 
and the variance observed at the level of healthcare pro- 
viders is often not significant. 



Table 2 The items that underlie the composite doctor's communication for the various patient groups 
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Table 3 The discriminative power of patient experience surveys for different patient groups and quality aspects in unadjusted and adjusted models, 
accompanied by the sample sizes required to detect differences between providers reliably. 



Empty model 



Adjusted model 



Doctor's communication 

Hip or knee surgery (nl 
= 43; n2 = 1462)^ 

Varicose veirns (nl = 20; 
n2 = 2189) 

Cataract surgery (nl = 
55; n2 = 4021) 

Spinal disc herniation 
(nl = 20; n2 = 1574) 

Cinange in physicai 
functioning 

Hip or knee surgery (nl 
= 43; n2 = 1345) 

Varicose veins (nl = 20; 
n2 = 1663) 

Cataract surgery (nl = 
55; n2 = 2982) 

Spinal disc herniation 
(nl = 20; n2 = 1592) 

Globai rating 

Hip or knee surgery (nl 
= 43; n2 = 1496) 

Varicose veins (nl = 20; 
n2 = 2169) 

Cataract surgery (nl = 
55; n2 = 3967) 

Spinal disc herniation 
(nl = 20; n2 = 1590) 

Recommendation to others 

Hip or knee surgery (nl 
= 43; n2 = 1497) 

Varicose veins (nl = 20; 
n2 = 2154) 

Cataract surgery (nl = 
55; n2 = 4003) 



95% expected range Required sample size Required number of 
of provider scores per provider patients to be 

approached per 
provider*^ 

Variance Variance ICC Variance Variance ICC Mean SD lower upper range reliability reliability reliability reliability 
providers^ Individuals'* providers' Individuals'* providers limit limit = .70 = .80 = .70 = .80 
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Table 3 The discriminative power of patient experience surveys for different patient groups and quality aspects in unadjusted and adjusted models, accom- 
panied by the sample sizes required to detect differences between providers reliably. (Continued) 

Spinal disc herniation 0.0160 0.5157 0.030 0.0146 0.4986 0.028 3.16 0.12 2.92 3.40 0.48 80 137 188 323 

{nl = 20; n2 = 1564) 

^ Variances in bold are significant (p < .05) 

^ The significance of variances at the level of individuals is not reported 
Derived from the required sample size and the response rate (hip or knee surgery (75%), varicose veins (62%), cataract surgery (72%), spinal disc herniation (42%}) 
nl denotes the number of healthcare providers, n2 denotes the total number of patients 
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To further examine the extent to which the surveys at 
issue are able to distinguish between health care provi- 
ders, we also calculated the range in which 95% of the 
provider means are expected to occur, given the var- 
iance at the level of healthcare providers (see Table 3). 
For the two variables that consisted of items containing 
four response categories, the range varied from 0.16 to 
0.40 (doctor's communication) and 0.32 to 0.52 (recom- 
mendation to others). For the global rating, the range 
varied from 0.59 to 1.21 points and for changes in phy- 
sical functioning, which consisted of items containing 
three response categories, the expected range varied 
from 0.05 to 0.26. It is worth noting that, although the 
global rating was the measure that discriminated best 
between providers only for varicose veins, the expected 
range of provider means was the largest across all 
patient groups. 

In addition, the number of observations per provider 
for reliable estimates of healthcare provider scores and, 
accordingly, meaningful comparison of provider scores, 
was calculated for a reliability of 0.70 and a reliability of 
0.80. Subsequently, the number of participants that 
should be approached to achieve the required number 
of observations, given the observed response rate was 
assessed. In cases where the discriminative power was 
small (ICC < 0.01), required sample sizes per healthcare 
provider were large (569 - 2516) for a reliability of 0.70 
and excessive (975 - 4313) for a reliability of 0.80. For 
the other measures, the required sample size varied 
from 41 to 193 for a reliability of 0.70 and from 70 to 
331 for a reliability of 0.80 (see Table 3). The number of 
participants that should be approached for reliable esti- 
mates at the provider level is dictated by the required 
sample size and the expected response rate. In the last 
two columns of Table 3, the number of patients that 
should be approached is presented, again for a reliability 
of 0.70 and a reliability of 0.80. Obviously, the number 
of patients that should be approached is higher than the 
required sample size in all cases. The magnitude of the 
difference between the two is determined by the 
response rate: in case of spinal disc herniation (response 
rate = 42%), the number of patients that should be 
approached is more than twice the required sample size 
whereas in case of hip or knee surgery (response rate = 
75%) the number of respondents is only about 1/3 
higher (see Table 3). 

Discussion 

The present study showed that the extent to which 
patient experiences are dependent on differences 
between providers is limited. The extent to which 
patient experiences are determined by provider differ- 
ences varied from 0.001 to 0.054, which means that 
0.1% to 5.4% of the variance in patient experiences may 



be attributed to health care providers. Accounting for 
common case mix adjusters generally reduced the extent 
to which patient experiences are attributable to provi- 
ders. Further, differences in discriminative power 
between patient groups and/or measures were inconsis- 
tent, with one exception: for all patient groups the dis- 
criminative power of experiences regarding change in 
physical functioning was particularly limited. As 
expected, the required number of patients to approach 
per provider was exceptionally large in cases where the 
discriminative power of a measure was low and response 
rates were low. 

The discriminative power of the various patient 
experience surveys as presented here is largely consis- 
tent with previous reports [4,5,12,19]. However, where it 
may be difficult to evaluate the parallels between pre- 
vious reports, as the experiences reported varied and the 
methodology used was not always consistent, the pre- 
sent study provides a comprehensive overview for differ- 
ent patient groups using corresponding measures for 
patient experiences and identical methods for data 
analyses. 

Whether the reported levels of discriminative power 
should be considered meaningful remains a matter of 
debate. It may be argued that the extent to which 
patient experiences are attributable to healthcare provi- 
ders is low and that the range in which 95% of provider 
scores are expected to occur is rather narrow. On the 
other hand, empirical data on the discriminative power 
of a wide variety of measures in primary care - including 
measures such as the short form 36 and the hospital 
anxiety and depression score, as well as blood pressure 
and cholesterol - showed that the median ICC is 0.01 
when looking at models without covariates and 0.005 
for models including covariates [28]. These values are 
exceeded by most of the measures of patient experiences 
presented in the present paper. It may be questioned 
however, whether the discriminative power can be eval- 
uated by statistical parameters alone. Ideally, the differ- 
ences between providers revealed by patient experience 
surveys should be considered in the context of data on 
other measures on the same quality aspects that are 
independent of patient experiences. When evaluating 
the discriminative power of patient experience surveys 
regarding doctor's communication for example, it would 
be helpful to know how independent observers would 
rate the communication skills of a doctor at the lower 
versus the higher end of the range. Such information 
would illustrate the meaning of differences in patient 
experiences between providers. 

The discriminative power of patient experiences varied 
between measures and surveys. One consistent trend 
that was observed was the limited discriminative power 
of experiences regarding changes in relevant elements of 
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physical functioning following treatment. Admittedly, 
the development of such measures as indices of health- 
care provider performance is far from complete. 
Accordingly, it is possible that providers do differ in 
terms of the experienced change in physical functioning, 
but that the retrospective measures used to assess these 
differences in the present paper are not sufficiently sen- 
sitive. In this context, it should be acknowledged that 
measures of changes in physical functioning have been 
successfully used to compare the effects of various 
healthcare interventions, albeit in a different format 
using pre- and post measurements [29,30]. Such a strat- 
egy would also allow a more advanced case-mix adjust- 
ment as the pre measurement may be used to account 
for differences in baseline health status. However, the 
use of pre- and post measurements in the context of 
continuous nationwide monitoring of patient experi- 
ences would substantially increase costs and respondent 
burden. Therefore, the CQ-index initially attempted to 
incorporate assessment of experiences regarding change 
in physical functioning, in a single measurement. Never- 
theless, since the present strategy failed to demonstrate 
differences between providers, future attempts to adopt 
measures of experienced change in physical functioning 
as indices of provider performance should consider 
alternative strategies including those containing pre- 
and post measurements [30]. 

The present paper also reported the required number 
of patients to be approached for reliable estimates at the 
provider level, and accordingly for meaningful compari- 
son of provider scores. The number of patients that 
should be approached is dependent on two things: the 
discriminative power of the survey and the response 
rate. The present paper showed that the number of 
patients to be approached is often well in excess of 100, 
and may even reach thousands should a comparison be 
desired between providers for measures of patient 
experiences where these differences between providers 
are small. In our experience, the number of patients to 
be approached per provider is a heavily debated issue 
among researchers and stakeholders when setting up 
measurements. On the one hand, it is appealing to keep 
down the number of patients to be approached to 
reduce costs and to prevent exclusion of small provi- 
ders. On the other hand, larger numbers of patients 
allow more reliable estimates of provider scores and per- 
mit more and better distinctions between providers. For 
measures where the required number of patients to be 
approached for reliable estimates at the provider level is 
excessive due to a lack of differences between providers, 
we recommend that stakeholders consider whether such 
measures are useful for benchmarking purposes. It is 
unlikely of course that a benchmark would distinguish 
between providers in such cases, but on occasion it may 



be useful to illustrate that for some elements of care it 
does not matter which provider is chosen. 

Practical dilemmas arise when the number of patients 
to be approached for reliable estimates of provider 
scores is not excessive in itself, but can still not be 
achieved by most providers e.g. because the type of care 
at issue is delivered by small providers that only treat a 
limited number of patients a year. In such cases, strate- 
gies to increase the number of patients that can be 
approached per healthcare provider are of interest. For 
example, where results normally reflect patient experi- 
ences in the preceding year to ensure recent and up-to- 
date figures, this period may be lengthened. In addition, 
small providers are sometimes part of a larger organiza- 
tion. If there is sufficient uniformity of care provision 
within this organization, it may be possible to choose 
the unit of analysis at the level of the organization, 
rather than at the level of the providers underlying the 
organization. 

It should be noted that increasing the number of 
patients to be approached does not resolve issues of 
generalisability of results in case of a low response rate. 
Nonetheless, on the assumption that causes for non- 
response are broadly similar between providers and/or 
that possible response bias may be addressed through 
case mix adjustment, it may still be interesting to com- 
pare the experiences from respondents between provi- 
ders. In this context it may be useful to adjust the 
number of patients to be approached such that there 
will be sufficient observations for comparing providers. 

Several limitations deserve consideration when inter- 
preting the present findings. First, the variance at the 
level of providers, would partially depend on the hetero- 
geneity of the sample of providers. A more heteroge- 
neous sample of providers would result in a larger 
variance on the level of providers, an increased ICC and 
a reduced number of patients to be approached. 
Whether the heterogeneity of the sample of providers is 
representative of the heterogeneity of all providers is dif- 
ficult to determine. In addition, the heterogeneity of 
providers may vary between countries and/or health 
care systems. Nonetheless, it should be noted that the 
ICC's reported in the present article are broadly similar 
to those reported elsewhere [5,12,19,23], suggesting that 
if the accuracy of the observed variances could be 
improved, it is unlikely that this would lead to funda- 
mentally different results. Second, it is possible that the 
variance at the level of individuals is under or overesti- 
mated as a result of measurement error, which is an 
often ignored source of variance. Accordingly, it remains 
essential to develop surveys that are reliable, valid and 
sensitive. Third, the level of the health care provider 
consisted of hospitals rather than individual doctors or 
nurses since reporting quality scores on individual 
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health care providing staff is still a matter of debate in 
the Netherlands. Nevertheless, it is possible that differ- 
ences between individual doctors or nurses are larger 
than differences between hospitals or hospital depart- 
ments as assessing differences between individual nurses 
or doctors presents a more specific measurement. 
Indeed, evidence on patient reports of individual doctors 
showed a wider range of ICC's, varying from 0.02 to 
0.17 [31]. Thus, although reporting quality scores on 
individual doctors appears a sensitive issue, it is cer- 
tainly appealing from a methodological point of view. 

Conclusions 

In conclusion, the discriminative power of patient 
experience surveys remains an important issue in the 
development of indices of healthcare provider perfor- 
mance. The present paper showed that the discrimina- 
tive power of patient experience surveys is generally 
limited, but for most patient groups several measures 
provided sufficient discriminative power to allow reliable 
estimates of provider scores and, accordingly, meaning- 
ful comparisons of provider scores using sample sizes 
that can be achieved by most providers. In particular, 
differences between providers were small for items 
focusing on changes in physical functioning as indices 
of healthcare provider performance. Future research 
should explore other strategies for measuring patient 
experiences regarding change in physical functioning, 
intending to identify more sensitive measurement strate- 
gies. Other studies and projects may also benefit from 
overviews such as those given in the present paper 
when setting up data collection and determining the 
level of aggregation at which comparisons between 
healthcare providers are performed. 
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